Treebanks in Machine Translation

نویسندگان

  • Martin Čmejrek
  • Jan Cuřín
  • Jiří Havelka
چکیده

We present an approach using treebanks in machine translation. Our experiment in Czech-English machine translation is an attempt to develop a full machine translation system based on dependency trees (Dependency Based Machine Translation, DBMT). We use the following resources: Prague Dependency Treebank, a newly created Czech-English parallel corpus of Penn Treebank, English monolingual corpus, and translation lexicons. The fully automatic process includes analysis of the Czech input into tectogrammatical (semantic) representation, lexical and structural transfer, a simple rule-based system for generation to English surface realization, and an -gram language model for scoring and choosing from translation hypotheses. The results are evaluated quantitatively with BLEU score.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Generation of Parallel Treebanks

The need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. This is true especially for parallel treebanks, of which very few exist. The ones that exist are mainly hand-crafted and too small for reliable use in data-oriented applications. In this paper we introduce a novel platform for fast and robust automatic generation of paral...

متن کامل

Large aligned treebanks for syntax-based machine translation

We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntaxand example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we ...

متن کامل

Morphologically and Syntactically Annotated Corpora of Many Languages

Annotated corpora have become a standard resource for research in both linguistics and computational processing of natural languages. Lexicographers judge word usage and distribution by occurrences in corpora; part-of-speech tags may help them narrow their queries. Grammarians may use syntactically annotated corpora (treebanks) for queries such as “show me all examples where a verb governs two ...

متن کامل

Unsupervised Generation of Parallel Treebanks through Sub-Tree Alignment

e need for syntactically annotated data for use in natural language processing has increased dramatically in recent years. is is true especially for parallel treebanks, of which very few exist. e ones that exist are mainly hand-craed and too small for reliable use in data-oriented applications. In this paper we introduce an open-source system for fast and robust automatic generation of para...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003